quickwit: add tag_fields on CounterID, drop positions on raw text#877
Closed
alexey-milovidov wants to merge 48 commits into
Closed
quickwit: add tag_fields on CounterID, drop positions on raw text#877alexey-milovidov wants to merge 48 commits into
alexey-milovidov wants to merge 48 commits into
Conversation
Add Quickwit entry
Some historical clickhouse-cloud entries stored cluster_size as a
JSON string ("1", "2", "3") while modern ones use plain integers
(1, 2, 3). The dashboard treats the two representations as distinct
values and renders cluster_size 2 and 3 twice in selectors. Convert
all string-numeric cluster_size values to integers across the repo.
Non-numeric strings (serverless, dedicated) are left alone.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
generate-results.sh used to take, for every (system, basename), the latest dated copy across all date subdirectories. That meant the website would still surface a benchmark machine after the system was re-run on a new date that no longer covers that machine. Switch the rule: for each <system>/results/, find the lexicographically greatest YYYYMMDD subdirectory and emit every file it contains. Older subdirs remain in the repo as history but are not rendered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Coerce cluster_size to integer
# Conflicts: # data.generated.js
Use only the latest date subdir of each system for the dashboard
Revert #874: restore previous generate-results.sh behavior
When all selected systems return null for a query, the per-query baseline becomes Math.min() over an empty set (Infinity), which makes log(curr/Infinity) = -Infinity and collapses every system's geometric mean to 0 - bars render with width 0 and the chart appears empty. Reproduction: filter to Elasticsearch + Quickwit (Q28's REGEXP_REPLACE fails on both). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Skip queries that fail on every filtered system
The c8g.metal-48xl run was committed in 2b124ba ("Update clickhouse-datalake-partitioned results", authored 2026-02-18) with the date field accidentally set to "2027-02-18". The restructure in bb91b0c then put it under results/20270218/ — making it the lexicographically-latest dir despite containing a single machine, which masked the real latest dir (20260506). Move the file to results/20260218/ alongside the other 2026-02-18 results from the same commit, and correct the date field. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Identify obsolete results by comparing each system's older-dated result files against the canonical pre-refactor flat layout (`bb91b0cf5~1`, the commit just before the date-subdir restructure). Any older-dated file whose basename was not in that flat layout represents a machine/configuration that had already been deleted from the canonical state — mark those `"historical"` so the dashboard doesn't surface them. This catches: - Old ClickHouse Cloud configurations (Dedicated, colder-cache and parallel-replicas experiments, retired size tiers 40 / 56 / 80 / 128 / 240 GiB). - Old ClickHouse hardware (c5.4xlarge, m5d.24xlarge, m6i.32xlarge, *.zstd, *.tuned, *.tuned.memory, c5n.4xlarge for clickhouse-web). - Per-system retired runs (DataFusion `f16s_v2` and old `single.json`, MotherDuck `result.json`/`result_*`/`pulse`/ `standard`, polars and polars-dataframe retired filenames, starrocks `*.untuned`, paradedb 1500GB, hydra `c6a.4xlarge` (now on `hydra.json`), etc). Also drops tags from older results that aren't in the union of tags in the system's latest dated subdir, except `"historical"` (catches residuals like `"analytical"`, `"MySQL compatible"` in old databend, `"Python"` in arc, `"open-source"`/`"dataframe"`/`"parquet"` in polars, etc — applying the rules from earlier tag-removal commits d661b49 / 46a535b / fb09092 / ae85f0d / 0aab48e to the historical copies that still carried the deprecated tags). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the manual arch-detection + zip download with the official installer at install.gizmosql.com, mirroring the pattern DuckDB uses in this repo. The installer handles arch/OS detection and installs to ~/.local/bin by default, which we then prepend to PATH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…3a.small
- datafusion/results/<YYYYMMDD>/single.json renamed to c6a.4xlarge.json
to match the per-machine naming used everywhere else; the historical
tag is removed from those files since they no longer represent an
obsolete basename.
- datafusion/results/20250522/single.json deleted as redundant —
c6a.4xlarge.json already exists in the same dir with identical
metadata and identical numeric results (the only diffs are
trailing-zero formatting).
- duckdb-vortex/results/20250521/c6a.4xlarge-single.json deleted for
the same reason — same date / system / machine / metadata as the
canonical c6a.4xlarge.json next to it.
- firebolt-parquet{,-partitioned}/results/20260221/t3a.small.json
removed entirely; those entries were incorrect.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both systems had genuine standalone runs on AWS hardware that were
incorrectly tagged "historical" by the pre-refactor flat-layout
heuristic — the flat layout only kept the most recent canonical
machine per system, so older one-off machines looked obsolete even
though the run is still meaningful as a historical comparison point.
- glaredb/results/20240202/c6a.metal.json — drop historical
- hydra/results/{20221209,20230919}/c6a.4xlarge.json — drop historical
Also delete glaredb/results/20250525/c6a.4xlarge-parquet-single.json
as redundant (same date / system / machine / metadata as the canonical
c6a.4xlarge.json next to it; numerical results identical, only
trailing-zero formatting differs — same situation as the
datafusion/duckdb-vortex *-single duplicates removed in the previous
commit).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- motherduck/results/{20240127,20241029}/result.json renamed to
result_standard.json. The runs were originally machine="cloud"
(back when Motherduck only offered one tier); update machine to
"Motherduck: standard" to match current naming and drop the
historical tag.
- paradedb/results/20240202/c6a.4xlarge.1500gb.json deleted —
identical results to c6a.4xlarge.json next to it; the .1500gb
filename was a one-off clarification for an Elasticsearch comparison
per its comment field. The canonical c6a.4xlarge.json carries the
same numbers without that side-comment.
- paradedb/results/20240713/single.json deleted — same date / system /
machine / load_time / data_size as the canonical c6a.4xlarge.json
next to it; results differ only by tiny numerical noise (<= 0.001s).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- polars/results/20241129/DataFrame_c6a.metal.json moved to
polars-dataframe/results/20241129/c6a.metal.json (the run is
system="Polars (DataFrame)", so it belongs in polars-dataframe).
- polars/results/{20241129,20241215}/parquet_c6a.metal.json /
parquet_c6a.4xlarge.json renamed to drop the parquet_ prefix
(parquet is the default encoding for polars/, so the prefix is
redundant — polars-dataframe/ is the dataframe variant).
- Historical tag dropped from all three renamed files.
The pre-existing canonical c6a.metal.json / c6a.4xlarge.json in those
date dirs were re-runs that ended up there because their date field
wasn't updated when the data was refreshed in commit 69d3e50;
the renamed files carry the actual 2024-11-29 / 2024-12-15 numbers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- starrocks/results/{20220715,20220925}/*.untuned.json — old untuned
variant from when both tuned and untuned runs were captured
separately. The canonical c6a.4xlarge.json / c6a.metal.json next
to them already record an untuned run (tuned="no") with the
modern schema.
- timescaledb/results/20220701/c6a.4xlarge.compression.json — old
compression-on variant; the canonical c6a.4xlarge.json carries
the standard TimescaleDB run for that date.
- trino{,-partitioned}/results/202605{06,07}/c8g*.json — c8g runs
removed entirely (per maintainer instruction).
- umbra/results/20251026/c6a.{2xlarge,xlarge}.json — incorrect
results, removed entirely.
- arc/results/2025*/m3_max*.json — m3_max runs removed entirely.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…large,metal-48xl}
- starrocks/results/{20220715,20220925}/{c6a.4xlarge,c6a.metal}.json
replaced with the content of the previously-deleted *.untuned.json
files. The untuned numbers are the right canonical record for those
dates (the prior "tuned" canonical was a parallel run that wasn't
the one used to establish the historical entry). Drops the
"historical" tag and the "ClickHouse derivative" tag (not in latest
starrocks tag set), keeps system="StarRocks".
- trino{,-partitioned}/results/20260507/c8g.4xlarge.json and
c8g.metal-48xl.json restored. Per maintainer note, only
c8g.24xlarge.json was supposed to be removed; the other two c8g
variants stay.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mark stale results historical to clean up the dashboard
motherduck/ uses lowercase tier names ("Motherduck: jumbo",
"Motherduck: mega", "Motherduck: standard"); pg_duckdb-motherduck/
had three files with "Motherduck: Jumbo" (capital J). Lower-case the
J so the dashboard groups all jumbo-tier runs under one machine.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rt sizes For cloud-service results whose .machine value contains a memory size (GB / GiB) or a T-shirt size (XS / S / M / L / XL / NXL etc), drop the redundant cloud-name prefix so the dashboard groups runs by the actual size rather than the (system, machine) tuple. The system field on each entry already carries the cloud name; repeating it inside .machine just bloats the X axis. Also normalize T-shirt sizing variants to their letter form: Small → S, Medium → M, Large → L, X-Small → XS, X-Large → XL, 2X-Small → 2XS, 2X-Large → 2XL, 3X-Large → 3XL, 4X-Large → 4XL, 5X-Large → 5XL. Affected systems: AlloyDB, ByteHouse, CHYT, ClickHouse Cloud (every aws/azure/gcp tier), CrunchyBridge, Databricks, Hydra, Snowflake, Supabase, Tablespace, Timescale Cloud, pgpro_tam. Bare-metal hardware descriptions (CPU model + RAM, "AWS c5.metal 100GB", etc) are left unchanged — the rule applies to managed-cloud machine labels only. Aurora's "16acu", Hologres' "16 CU", Redshift's "ra3.4xlarge", and SingleStore's "S2"/"S24" don't match the GB or T-shirt-size pattern and are also left alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Normalize Motherduck Jumbo to jumbo
Convert "<digits><space?>GB" → "<digits>GiB" in cloud-service machine names. Where the value also carries an "<N> vCPU " prefix in front of the GB amount, drop that prefix — the GiB tier already conveys the size, so "8 vCPU 64 GB" simplifies to "64GiB". Examples: - "8 vCPU 64 GB" (AlloyDB) → "64GiB" - "10 vCPU 40GB" (CHYT) → "40GiB" - "720GB" (CHYT) → "720GiB" - "Analytics-256GB" (Crunchy Bridge) → "Analytics-256GiB" - "L1 - 16CPU 32GB" (Tablespace) → "L1 - 16CPU 32GiB" (16CPU is not "vCPU" so it stays) - "8 vCPU 32GB" (Timescale ☁️) → "32GiB" - "16 vCPU 32GB" / "30 vCPU 480GB" (pgpro_tam) → "32GiB" / "480GiB" - "64 vCPU 256GB" (YDB) → "256GiB" Bare-metal hardware descriptions in hardware/, versions/, gravitons/ (e.g. "AWS c5.metal 100GB", "Linode 16GB", "Steam Deck 512 GB", "AMD EPYC 3.2 GHz, Micron 5100 MAX 960 GB") are left alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Normalize machine names: drop redundant cloud prefix, normalize T-shirt sizes
The c7i.metal-48xl runs in chdb / chdb-dataframe / chdb-parquet-partitioned were one-off captures that aren't part of the canonical machine set (the canonical chdb suite uses c6a / c7a / c8g variants). Tag them "historical" so they stop appearing on the dashboard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
"Analytics-256GiB" → "256GiB". The system field already says "Crunchy Bridge (Parquet)", so the "Analytics-" prefix is redundant once the cloud-name has been dropped from the machine label. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Keep only the RAM size as the machine label. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Normalize machine names
GizmoSQL: use the official one-line installer
Revert "Revert #845"
|
github seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
tag_fields: [CounterID] writes per-split CounterID values into the metastore so the searcher can prune whole splits before opening them for queries 37-43, which all filter CounterID = 62 — the closest analogue to Elasticsearch's index.sort early-termination here. record: basic on every tokenizer: raw text field skips storing freqs and positions in the postings; phrase queries can never run against single-term raw fields, so the data was dead weight. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
76a4092 to
5bb3d7e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two index-level changes to
quickwit/index_config.yaml, keeping the rest of the benchmark setup identical.tag_fields: [CounterID]— Q37-Q43 all filterCounterID = 62. Tagging it writes the per-split CounterID values into the metastore so the searcher can prune whole splits before opening them. This is the closest analogue we get to Elasticsearch'sindex.sortearly-termination on the same column. Quickwit/Tantivy has no real multi-column doc-sort to match the full ESsort.field: [CounterID, EventDate, UserID, EventTime, WatchID], so this picks up just the CounterID dimension.record: basicon everytokenizer: rawtext field (28 fields). Tantivy defaults text postings toWithFreqsAndPositions, but raw-tokenized fields only ever hold one term per document — phrase queries can't run against them, so freqs and positions are dead weight in the index.Validated against the running v0.9.0-nightly server (the same image
benchmark.shuses): thetag_fieldsandrecord: basicsettings round-trip cleanly through the index-create API.Test plan
bash benchmark.shend-to-end on a fresh machinetag_fieldsbenefitrecord: basic🤖 Generated with Claude Code